2.1 Data Source
2.3 Target Feature
3.1 Preliminaries
3.2 Data Cleaning and Transformation
Statistical Modeling and Performance Evaluation
5.1 Full Model
5.2 Full Model Diagnostic Checks
5.3 Model 2
5.5 Model 3
5.7 Model 4
5.9 Model 5
The purpose of this study is to build a model to predict diamond price, with the attribute data of 53940 round cut diamonds. The dataset used is from Dr Vural Aksakalli's GitHub: https://github.com/vaksakalli/datasets/blob/master/diamonds.csv (Aksakalli, 2019).
Dr Aksakalli provides documentation for this dataset at: https://github.com/vaksakalli/datasets/blob/master/github_datasets_desc.ipynb (Aksakalli, 2019). More detailed descriptions of this dataset's features can be found at https://ggplot2.tidyverse.org/reference/diamonds.html (Wickham & Co., 2015).
Dr Aksakalli's dataset appears to be a mirror of the original diamonds dataset from ggplot2, which can be found at https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/diamonds.csv (Wickham & Co., 2015).
This report is structured into sections as described below:
We used the diamonds.csv file from Dr. Vural's Github for this analysis. The diamonds.csv file contains all feature details, and has a total size of 53940 observations. Extended variable descriptions for this dataset were found in the original ggplot2 diamonds reference documentation.
The objective of this project is to observe whether there is a relationship between the price of a diamond and its features, and in turn attempt to predict the price of a diamond based on those characteristics.
The target feature is price, a continuous numerical feature. Because the target feature is numeric (we are trying to predict price), this is a linear regression problem.
Variable descriptions in diamonds.csv :
carat: Continuous.cut: Fair, Good, Very Good, Premium, Ideal. (total 5 categories)color: D, E, F, G, H, I, J. (total 7 categories)clarity: I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF. (total 8 categories)depth: Continuoustable: Continuousx: Continuousy: Continuousz: Continuousprice: ContinuousCarat, cut and price variables are self explanatory. Other less obvious variable explanations, as described in the ggplot2 reference documentation, are:
color rates the diamonds colour from D (worst) to J (best). clarity rates the diamond's clarity from I1 (worst) to IF (best).x is the diamond's length in mm.y is the diamond's width in mm.z is the diamond's depth in mm.depth is the total depth percentage of the diamond = z / mean(x, y) = 2 * z / (x + y)table is the flat top of the diamond, or the width of the diamond relative to it's widest point.It is important to note that depth is fundamentally different to z, depite both of them measuring "depth". z measures the absolute top-to-bottom size of the stone im mm, whereas depth (also known as "depth percentage") is expressed as a percentage, and is a measure of whether a diamond is overweight or underwight relative to its diameter.
We begin by importing required modules and setting global variables. We also assign constant SAMPLE_SIZE.
# Import required modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import warnings
import patsy
from IPython.display import display, HTML
warnings.filterwarnings('ignore')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn-white")
sns.set(style="dark", color_codes=True)
SAMPLE_SIZE = 10000
Next, we import the data (.csv file) from its source URL and create a DataFrame. The .csv already has a header, so no need to specify column (attribute) names.
url = ('https://raw.githubusercontent.com/vaksakalli/datasets/master/diamonds.csv')
diamond_data = pd.read_csv(url, sep = ',', header = 0)
diamond_data.sample(15, random_state=99)
We now check the data types of each feature, checking that the data types match the descriptions found in the ggplot2 documentation. As this dataset is 53,940 observations in size, for ease of computation, we will work with a random sample of the data, subsetting to a working sample of 10,000 random observations.
# Subset original dataset
diamond_data_sample = diamond_data.sample(n=SAMPLE_SIZE, random_state=479)
# Print shape and data types
print("The shape of this sampled dataset:", diamond_data_sample.shape)
print("Data types for each feature shown below with 'object' being a string type.\n")
print(diamond_data_sample.dtypes)
We can see there are a total of 7 numeric and 3 categorical features.
print("Number of missing values for each feature:\n")
print(diamond_data_sample.isnull().sum())
There are no NaN values for any attribute. We will later check for any incorrectly named or labelled values.
# Use int64 and float64 types to include all continuous variables
display(HTML('<b>Table 1: Summary of continuous features</b>'))
diamond_data_sample.describe(include=['int64', 'float64'])
# Use object type to include all categorical variables
display(HTML('<b>Table 2: Summary of categorical features</b>'))
diamond_data_sample.describe(include='object')
Observation of Table 2 shows us that cut, color and clarity have 5, 7, and 8 categories respectively, which matches the totals for each variable described previously in Overview - Descriptive Features. This means all categorical variable observation data is correctly labelled within the dataset (no cardinality issues), and we do not need to fix or amend the data for these variables.
All continuous variables of our dataset would intuitively appear to have predictive power (an effect on price), so we will leave them all as is. Variables depth and table are derivatives of x, y, and z. At this stage we can't tell if removing depth and table will affect our predictions, so lets leave them in for the full model.
We examine the unique values of each of the categorical variables.
cat_vars = ['cut', 'clarity', 'color']
for var in cat_vars:
print("Unique values for", var)
print(diamond_data_sample[var].unique(), "\n")
Fortunately, our variables are already formatted nicely, and do not require any further cleanup.
diamond_data_sample.head()
Much the same as the our categorical variables, our column names are already formatted nicely, and do not require any further cleanup.
We decided to explore the price of diamonds using a line plot, boxplot, and histogram.
Line plot:
# create, format and display a line plot
plt.figure(figsize=(18,6))
p1 = diamond_data_sample['price'].value_counts().sort_index().plot(kind="line", color="royalblue")
plt.title("Figure 1: Line chart of Diamond price", fontsize = 16)
plt.xlabel('Price (USD)', fontsize=16)
plt.ylabel('Count', fontsize=16)
plt.show()
The line plot above reveals the prices for diamonds (contained within our random sample) and the number of times it occurs. As can be seen, diamonds are more commonly priced on the lower end of their price range at roughly 1000 USD.
Boxplot for diamond price:
# create, format and display a boxplot
plt.figure(figsize=(18,6))
p2 = sns.boxplot(diamond_data_sample['price'], palette="Blues").set_title('Figure 2: Box Plot of Diamond Price', fontsize = 16)
plt.xlabel('Price (USD)', fontsize=16)
plt.show();
We can see from the boxplot (Figure 2) that the price feature is considerably right-skewed. The left whisker of the boxplot is equal to the minimum price while the right whisker is roughly at 12000, hence all diamond prices above 12000 USD can be considered outliers and therefore very rare.
It can also be observed that 50% of diamonds will be priced between roughly 900 USD and 5400 USD, with a median of 2500 USD.
Histogram for diamond price:
# create, format and display a histogram
plt.figure(figsize=(18,6))
p3 = sns.distplot(diamond_data_sample['price'], kde=True, bins=18).set_title('Figure 3: Histogram of Diamond Price', fontsize = 16)
plt.xlabel('Price (USD)', fontsize=16)
plt.ylabel('Frequency', fontsize=16)
plt.show();
#
As we could already see from Figure 2, Figure 3 also demonstrates Price's right-skewedness, having a long right tail. The histogram shows the percentage frequency of diamond prices with each column having a range of 1000 USD and starting at roughly 350 USD. Therefore it can easily be seen that majority of diamonds fall within 350-5000 USD.
Here, we decided to use pair plots.
With pair plots we visualise all variables and how they relate to and affect price.
Pair plots of Numeric features and Price
sns.set()
g = sns.pairplot(diamond_data_sample,height=6,x_vars=["carat"],y_vars=["price"])
g.fig.set_size_inches(18,6)
plt.title("Figure 4: Relationship Between Carat and Price", fontsize = 16)
plt.show()
It can be clearly seen from Figure 4 that as the diamond's carat increases, so does the price of said diamond. Although a little difficult to tell, the price most likely increases exponentially versus carat.
#Plot of Depth and Price
sns.set()
g = sns.pairplot(diamond_data_sample,height=6,x_vars=["depth"],y_vars=["price"])
g.fig.set_size_inches(18,6)
plt.title("Figure 5: Relationship Between Depth % and Price", fontsize = 16)
plt.show()
From Figure 5 of our data, no clear relationship between depth % and price exists (The depth % is equal to the depth of the diamond divided by its width). This could be because of the characteristics we are analysing, the depth % has the least affect on the price. If instead all characteristics except the depth % were identical, then perhaps we would see a clear relationship.
#Plot of Table and Price
sns.set()
g = sns.pairplot(diamond_data_sample,height=6,x_vars=["table"],y_vars=["price"])
g.fig.set_size_inches(18,6)
plt.title("Figure 6: Relationship Between Table and Price", fontsize = 16)
plt.show()
As was with Figure 5, Figure 6 shows that there appears to be no clear relationship between the price and the diamonds table %. (The table % is the width of the diamond's table divided by the width of the diamond)
#Plot of X and Price
sns.set()
g = sns.pairplot(diamond_data_sample,height=6,x_vars=["x"],y_vars=["price"])
g.fig.set_size_inches(18,6)
plt.title("Figure 7: Relationship Between Diamond X (mm) and Price", fontsize = 16)
plt.show()
#Plot of Y and Price
sns.set()
g = sns.pairplot(diamond_data_sample,height=6,x_vars=["y"],y_vars=["price"])
g.set(xlim=(-1,11)) #Set limit on x-axis as there is a single point around x=60
g.fig.set_size_inches(18,6)
plt.title("Figure 8: Relationship Between Diamond Y (mm) and Price", fontsize = 16)
plt.show()
#Plot of Z and Price
sns.set()
g = sns.pairplot(diamond_data_sample,height=6,x_vars=["z"],y_vars=["price"])
g.fig.set_size_inches(18,6)
plt.title("Figure 9: Relationship Between Diamond Z (mm) and Price", fontsize = 16)
plt.show()
As can be seen in Figures 7, 8 and 9, the price of a diamond is significantly affected by its dimensions, with each figure corresponding to the diamond's x, y and z dimensions respectively. It can clearly be seen that the price increases exponentially for an increase in any dimension.
#Plot of Clarity and Price
plt.figure(figsize=(18,6))
sns.boxplot(diamond_data_sample['clarity'],diamond_data_sample['price']);
plt.title('Figure 10: Relationship between Clarity and Price',fontsize = 16);
plt.show()
Figure 10 shows the relationship between the diamond's clarity and its price. From these boxplots, it can be clearly observed that the clarity of a diamond has a relatively significant affect on the price. The clarities VVS1 and IF appear to have the lowest price range amongst the diamonds for this data set, while VS1, VS2 and SI2 seem to be more commonly higher than the rest.
#Plot of Cut and Price
plt.figure(figsize=(18,6))
sns.boxplot(diamond_data_sample['cut'],diamond_data_sample['price']);
plt.title('Figure 11: Relationship between Cut and Price',fontsize = 16);
plt.show();
Figure 11 shows the relationship between the cut of a diamond and its price. It can be seen that from our data, the cut of the diamond does have some affect on the price, with there being some variation in the prices for different cuts. Although there is not too much difference between each boxplot, it can be seen that overall the premium cut has the highest prices.
#Plot of Color and Price
plt.figure(figsize=(18,6))
sns.boxplot(diamond_data_sample['color'],diamond_data_sample['price']);
plt.title('Figure 12: Relationship between Color and Price',fontsize = 16);
plt.show()
Figure 12 shows the relationship between the diamond's color and price. From the boxplots, there is a notable variation in the price depending on its color. The color doesn't appear to have to much affect on the lower price range, but does have a significant affect on its higher prices. J and I appear to have the highest prices, with I coming out on top.
#Plot of Cut, Color and Price
plt.figure(figsize=(18,6))
sns.boxplot(diamond_data_sample['cut'],diamond_data_sample['price'],
hue = diamond_data_sample['color'], width = 0.8);
plt.title('Figure 13: Relationship between Cut, Color and Price',fontsize= 16);
plt.show();
Combining Figures 11 and 12, Figure 13 shows the relationship between the cut and color of a diamond and its price. This gives us a more detailed view of the relationship between a diamond's cut, color and price.
It can be seen that colors of diamonds can have different prices at differing cuts, with premium cut having the highest overall prices, and fair cut having the smallest range of prices.
# Plot of Clarity, Carat and Price
plt.figure(figsize=(18,6))
sns.scatterplot(diamond_data_sample['carat'], diamond_data_sample['price'], hue = diamond_data_sample['clarity'])
plt.title('Figure 14: Relationship between Clarity, Carat and Price', fontsize = 16);
plt.legend(loc = 'upper right')
plt.show();
Figure 14 takes the relationship between clarity and price from Figure 10 and adds carat to the relationship, giving a more detailed look at the affect clarity and carat have on price.
The scatter plot above shows the exponential relationship between carat and price from Figure 4 and divides it into the 8 different clarities. From the plot, it appears that I1 has the smallest increase in price as the carat increases, while IF and VVS1 have the highest.
We fit a full linear regression model (using all available features) to begin with. As noted previously depth and table are derivates of x, y and z, but lets leave them for now, as we can remove them in a later model.
diamond_data_sample.head()
# build the formula string from feature names, sans target feature
formula_string_indep_vars = ' + '.join(diamond_data_sample.drop(columns='price').columns)
formula_string = 'price ~ ' + formula_string_indep_vars
# print the formula string (sanity check)
print('formula_string: ', formula_string)
# create model from sampled dataset
full_model = sm.formula.ols(formula=formula_string, data=diamond_data_sample)
# fit the model
fitted_full_model = full_model.fit()
# print a summary of the model
print(fitted_full_model.summary(), "\n")
print(f"Regression number of terms: {len(fitted_full_model.model.exog_names)}")
print(f"Regression F-distribution p-value: {fitted_full_model.f_pvalue}")
print(f"Regression R-squared: {fitted_full_model.rsquared:.3f}")
print(f"Regression Adjusted R-squared: {fitted_full_model.rsquared_adj:.3f}")
Now that we have our fitted full model, lets plot it against what the actual values are to provide visual representation of its accuracy.
def plot_line(axis, slope, intercept, **kargs):
xmin, xmax = axis.get_xlim()
plt.plot([xmin, xmax], [xmin*slope+intercept, xmax*slope+intercept], **kargs)
# Create a scatter plot for Price vs Predicted price
plt.figure(figsize=(18,6))
plt.scatter(diamond_data_sample['price'], fitted_full_model.fittedvalues, alpha=0.3);
plot_line(axis=plt.gca(), slope=1, intercept=0, c="blue");
plt.xlabel('Price');
plt.ylabel('Predicted Price');
plt.title('Figure 15: Scatterplot of price vs predicted price for Model 1', fontsize=16);
plt.show();
We found our full model's adjusted R squared to be 0.925, indicating that about 92% of the variance can be explained by the model.
We observe, from Figure 15, that the model tends to over-price diamonds from around the $10000 point onwards, with considerably more actual observations clustered below the regression line than above.
We noted earlier from Figures 2 and 3 that Price is right-skewed. From observation of the p-values, we ascertain that both the y and z features are insignificant, with y having the higher p-value.
# Create scatterplot of residuals
plt.figure(figsize=(18,6))
sns.residplot(x=diamond_data_sample['price'], y=fitted_full_model.fittedvalues);
plt.ylabel('Residuals')
plt.title('Figure 16: Scatterplot of Residuals for Model 1', fontsize=16)
plt.show();
In Figure 16 we observe the residuals to be primarily clustered around 0, though with some spread towards -5000 (y axis) in the 0-5000 (x axis) price range. From the 7500 dollar mark onwards to 17500 (x axis), residuals spread down towards -10000 (y axis). There appear to be a few extreme outlier residuals at the uppermost bounds of the plot (y: 17500-20000, x: 15000-17500).
# Create a histogram of residuals
residuals = diamond_data_sample['price'] - fitted_full_model.fittedvalues
plt.figure(figsize=(18,6))
plt.hist(residuals, bins = 100);
plt.xlabel('residual');
plt.ylabel('frequency');
plt.title('Figure 17: Histogram of Residuals for Model 1', fontsize=16);
plt.show();
Our residuals histogram in Figure 17 looks relatively symmetric with a slight right-skew. This indicates that the residuals somewhat approximate a normal distribution.
From the full model, we now remove y as it had the highest p-value from the model summary
# remove the y feature as it has the highest p-value
# build the formula string from feature names, sans target feature and y
formula_string_indep_vars_v2 = ' + '.join(diamond_data_sample.drop(columns=['price','y']).columns)
formula_string_v2 = 'price ~ ' + formula_string_indep_vars_v2
# print the formula string (sanity check)
print('formula_string: ', formula_string_v2)
# create model from sampled dataset
full_model_2 = sm.formula.ols(formula=formula_string_v2, data=diamond_data_sample)
# fit the model
fitted_full_model_2 = full_model_2.fit()
# print a summary of the model
print(fitted_full_model_2.summary(), "\n")
print(f"Regression number of terms: {len(fitted_full_model_2.model.exog_names)}")
print(f"Regression F-distribution p-value: {fitted_full_model_2.f_pvalue}")
print(f"Regression R-squared: {fitted_full_model_2.rsquared:.3f}")
print(f"Regression Adjusted R-squared: {fitted_full_model_2.rsquared_adj:.3f}")
# Creating scatter plot
plt.figure(figsize=(18,6))
plt.scatter(diamond_data_sample['price'], fitted_full_model_2.fittedvalues, alpha=0.3);
plot_line(axis=plt.gca(), slope=1, intercept=0, c="blue");
plt.xlabel('Price');
plt.ylabel('Predicted Price');
plt.title('Figure 18: Scatterplot of price against predicted price for Model 2', fontsize=16);
plt.show();
This model returns an Adjusted R-Squared 0.925, meaning that roughly 92% of the variance can still be explained by the reduced model but with 1 less variable. Looking at the p-values we can see that they are all signinficant at the 5% level except for z has a p-value of 0.143 and is therefore insignificant at the 5% level.
Although both y and z had p-values greater than 0.05, the y feature was the only feature removed from this model because of its p-value being the highest.
# Create scatterplot of residuals
plt.figure(figsize=(18,6))
sns.residplot(x=diamond_data_sample['price'], y=fitted_full_model_2.fittedvalues);
plt.ylabel('Residuals')
plt.title('Figure 19: Scatterplot of Residuals for Model 2', fontsize=16)
plt.show();
In Figure 19 we observe the residuals to be primarily clustered around 0, though with some spread towards -5000 (y axis) in the 0-5000 (x axis) price range. From the 7500 dollar mark onwards to 17500 (x axis), residuals spread down towards -10000 (y axis). There appear to be a few extreme outlier residuals at the uppermost bounds of the plot (y: 15000 - 20000+, x: 6000-17500).
residuals2 = diamond_data_sample['price'] - fitted_full_model_2.fittedvalues
plt.figure(figsize=(18,6))
plt.hist(residuals2, bins = 100);
plt.xlabel('residual');
plt.ylabel('frequency');
plt.title('Figure 20: Histogram of residuals for Model 2', fontsize = 16)
plt.show();
Our residuals histogram in Figure 20 again look relatively symmetric with a slight right-skew. This indicates that the residuals somewhat approximate a normal distribution.
Continuing on from the previous model, we then removed z as it was the only other variable with a p-value greater than 0.05.
# remove the z feature, from the previous model, as it has the next highest p-value
# build the formula string from feature names, sans target feature, y and z
formula_string_indep_vars_v3 = ' + '.join(diamond_data_sample.drop(columns=['price','y','z']).columns)
formula_string_v3 = 'price ~ ' + formula_string_indep_vars_v3
# print the formula string (sanity check)
print('formula_string: ', formula_string_v3)
# create model from sampled dataset
full_model_3 = sm.formula.ols(formula=formula_string_v3, data=diamond_data_sample)
# fit the model
fitted_full_model_3 = full_model_3.fit()
# print a summary of the model
print(fitted_full_model_3.summary(), "\n")
print(f"Regression number of terms: {len(fitted_full_model_3.model.exog_names)}")
print(f"Regression F-distribution p-value: {fitted_full_model_3.f_pvalue}")
print(f"Regression R-squared: {fitted_full_model_3.rsquared:.3f}")
print(f"Regression Adjusted R-squared: {fitted_full_model_3.rsquared_adj:.3f}")
# Create a scatter plot for Price vs Predicted price
plt.figure(figsize=(18,6))
plt.scatter(diamond_data_sample['price'], fitted_full_model_3.fittedvalues, alpha=0.3);
plot_line(axis=plt.gca(), slope=1, intercept=0, c="blue");
plt.xlabel('Price');
plt.ylabel('Predicted Price');
plt.title('Figure 21: Scatterplot of price vs predicted price for Model 3', fontsize=16);
plt.show();
Like with the previous model, after removing z on top of y from the model, the Adjusted R-Squared value remains unchanged at 0.925, meaning that roughly 92% of the variance can still be explained by the reduced model but with 2 less variables.
With y having already been removed, z was the only other variable with a p-value greater than 0.05 and was therefore remove for this model.
# Create scatterplot of residuals
plt.figure(figsize=(18,6))
sns.residplot(x=diamond_data_sample['price'], y=fitted_full_model_3.fittedvalues);
plt.ylabel('Residuals')
plt.title('Figure 22: Scatterplot of Residuals for Model 3', fontsize=16)
plt.show();
In Figure 22 we observe the residuals to be primarily clustered around 0, though with some spread towards -5000 (y axis) in the 0-5000 (x axis) price range. From the 5000 dollar mark onwards to 17500 (x axis), residuals spread down towards -10000 (y axis). There appear to be a few extreme outlier residuals at the uppermost bounds of the plot (y: 15000 - 20000+, x: 6000-17500).
# Create a histogram of residuals
residuals = diamond_data_sample['price'] - fitted_full_model_3.fittedvalues
plt.figure(figsize=(18,6))
plt.hist(residuals, bins = 100);
plt.xlabel('residual');
plt.ylabel('frequency');
plt.title('Figure 23: Histogram of Residuals for Model 3', fontsize=16);
plt.show();
Our residuals histogram in Figure 23 yet again look symmetric with a slight right-skew. This indicates that the residuals somewhat approximate a normal distribution.
Next we decided to try and remove both table and depth from the previous model as they are both derivative of x, y and z. However, we decided to only do one at a time, starting with table.
# remove the Table feature as per comments throughout (because it is derivative of X, Y, Z)
# build the formula string from feature names, sans target feature, y, z and table
formula_string_indep_vars_v4 = ' + '.join(diamond_data_sample.drop(columns=['price','y','z','table']).columns)
formula_string_v4 = 'price ~ ' + formula_string_indep_vars_v4
# print the formula string (sanity check)
print('formula_string: ', formula_string_v4)
# create model from sampled dataset
full_model_4 = sm.formula.ols(formula=formula_string_v4, data=diamond_data_sample)
# fit the model
fitted_full_model_4 = full_model_4.fit()
# print a summary of the model
print(fitted_full_model_4.summary(), "\n")
print(f"Regression number of terms: {len(fitted_full_model_4.model.exog_names)}")
print(f"Regression F-distribution p-value: {fitted_full_model_4.f_pvalue}")
print(f"Regression R-squared: {fitted_full_model_4.rsquared:.3f}")
print(f"Regression Adjusted R-squared: {fitted_full_model_4.rsquared_adj:.3f}")
# Create a scatter plot for Price vs Predicted price
plt.figure(figsize=(18,6))
plt.scatter(diamond_data_sample['price'], fitted_full_model_4.fittedvalues, alpha=0.3);
plot_line(axis=plt.gca(), slope=1, intercept=0, c="blue");
plt.xlabel('Price');
plt.ylabel('Predicted Price');
plt.title('Figure 24: Scatterplot of price vs predicted price for Model 4', fontsize=16);
plt.show();
After removing table from the model, it can be seen that the Adjusted R-Squared value still remains unchanged at 0.925, meaning that roughly 92% of the variance can still be explained by the reduced model but now with 3 less variables.
This also shows that, like we thought, table and most likely depth do not affect the model.
# Create scatterplot of residuals
plt.figure(figsize=(18,6))
sns.residplot(x=diamond_data_sample['price'], y=fitted_full_model_4.fittedvalues);
plt.ylabel('Residuals')
plt.title('Figure 25: Scatterplot of Residuals for Model 4', fontsize=16)
plt.show();
In Figure 25 we observe the residuals to be primarily clustered around 0, though with some spread towards -5000 (y axis) in the 0-5000 (x axis) price range. From the 7500 dollar mark onwards to 17500 (x axis), residuals spread down towards -10000 (y axis). There appear to be a few extreme outlier residuals at the uppermost bounds of the plot (y: 15000 - 20000+, x: 6000-17500).
# Create a histogram of residuals
residuals = diamond_data_sample['price'] - fitted_full_model_4.fittedvalues
plt.figure(figsize=(18,6))
plt.hist(residuals, bins = 100);
plt.xlabel('residual');
plt.ylabel('frequency');
plt.title('Figure 26: Histogram of Residuals for Model 4', fontsize=16);
plt.show();
After removing table, our residuals histogram in Figure 26 still looks symmetric with a slight right-skew. This indicates that the residuals somewhat approximate a normal distribution.
Finally, as mentioned previously, we removed depth from the model as we were quite confident that it would have no affect on it.
# remove the Depth feature as per comments throughout (because like table, it is derivative of X, Y, Z)
# build the formula string from feature names, sans target feature, y, z, table and depth
formula_string_indep_vars_v5 = ' + '.join(diamond_data_sample.drop(columns=['price','y','z','table','depth']).columns)
formula_string_v5 = 'price ~ ' + formula_string_indep_vars_v5
# print the formula string (sanity check)
print('formula_string: ', formula_string_v5)
# create model from sampled dataset
full_model_5 = sm.formula.ols(formula=formula_string_v5, data=diamond_data_sample)
# fit the model
fitted_full_model_5 = full_model_5.fit()
# print a summary of the model
print(fitted_full_model_5.summary(), "\n")
print(f"Regression number of terms: {len(fitted_full_model_5.model.exog_names)}")
print(f"Regression F-distribution p-value: {fitted_full_model_5.f_pvalue}")
print(f"Regression R-squared: {fitted_full_model_5.rsquared:.3f}")
print(f"Regression Adjusted R-squared: {fitted_full_model_5.rsquared_adj:.3f}")
# Create a scatter plot for Price vs Predicted price
plt.figure(figsize=(18,6))
plt.scatter(diamond_data_sample['price'], fitted_full_model_5.fittedvalues, alpha=0.3);
plot_line(axis=plt.gca(), slope=1, intercept=0, c="blue");
plt.xlabel('Price');
plt.ylabel('Predicted Price');
plt.title('Figure 27: Scatterplot of price vs predicted price for Model 5', fontsize=16);
plt.show();
After removing depth from the model, a change in the Adjusted R-Squared value occured, bringing down from 0.925 to 0.924. The change however was so insignificant that 92% of the variance can still be explained by the reduced model, now only having 5 of its original 9 explanatory variables left.
Like with table, we were fairly confident that removing depth would leave the model unaffected. The above proved us correct.
# Create scatterplot of residuals
plt.figure(figsize=(18,6))
sns.residplot(x=diamond_data_sample['price'], y=fitted_full_model_5.fittedvalues);
plt.ylabel('Residuals')
plt.title('Figure 28: Scatterplot of Residuals for Model 5', fontsize=16)
plt.show();
In Figure 28 we observe the residuals to be primarily clustered around 0, though with some spread towards -5000 (y axis) in the 0-5000 (x axis) price range. From the 7500 dollar mark onwards to 17500 (x axis), residuals spread down towards -10000 (y axis). There appear to be a few extreme outlier residuals at the uppermost bounds of the plot (y: 15000 - 20000+, x: 6000-17500).
# Create a histogram of residuals
residuals = diamond_data_sample['price'] - fitted_full_model_5.fittedvalues
plt.figure(figsize=(18,6))
plt.hist(residuals, bins = 100);
plt.xlabel('residual');
plt.ylabel('frequency');
plt.title('Figure 29: Histogram of Residuals for Model 5', fontsize=16);
plt.show();
Although there is a slight change now, our residuals histogram in Figure 29 still looks symmetric with a slight right-skew. This indicates that the residuals somewhat approximate a normal distribution.
Of the five linear regression models tested, the best (most relevant features and most accurate predictive power) is Model 5, which utilises features cut, carat, color, clarity and x only. These are the only features that demonstrated statistical signficance at any conventional level, observed by the unchanging R-squared value for each subsequent model as we removed all the other features not present in Model 5, from Model 2, 3 and 4 (model 2, 3 and 4 had the exact same R-squared, 0.925).
To reinforce this further, the only time we could introduce a drastic change in any of the models was by removal of any one of the final features present in Model 5. We did not consider the tiny difference in R-squared between the fifth and all other models (difference of 0.001) to be significant, and as such disregarded it.
Our best regression model, (Model 5), reasonably accurately predicts price within ±5000 USD, up until the $15000 point onwards, from which it tends to slightly over-price, showing more actual observations clustered below the regression line than above from that point on.
From observing the x, ,y, and z vs price plots (Figures 7, 8, 9) we note that the x feature of Model 5 could likely be replaced with either the y or z feature, as the plots for these three features vs price are trivially dissimilar, and represent three physical properties of each diamond that are very similar (dimension by axis).
Thus, we have found that there is a significant relationship between the price of a diamond and the features in Model 5, and that we can predict the price of a given diamond best utilising Model 5's feature set.
Fisher, F. (2013). Diamond Depth Percentages In Round Brilliant Cut Diamonds [online]. Available at http://findmyrock.com/diamond-basics/diamond-cut/understanding-diamond-depth-percentages/ [Accessed 2019-11-09]
Aksakali, V. (2019). datasets/diamonds.csv [online]. Available at https://github.com/vaksakalli/datasets/blob/master/diamonds.csv [Accessed 2019-11-09]
Wickham, H, Chang, W, Henry, L, Pederson, TL, Takahashi, K, Wilke, C, Woo, K, Yutani, H. (2015) Prices of 50,000 round cut diamonds [online]. Available at https://ggplot2.tidyverse.org/reference/diamonds.html [Accessed 2019-11-09]
Wickham, H, Chang, W, Henry, L, Pederson, TL, Takahashi, K, Wilke, C, Woo, K, Yutani, H. (2015) diamonds.csv [online]. Available at https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/diamonds.csv [Accessed 2019-11-09]